transformers里的AutoTokenizer和AutoModel的原理分析(一) |
您所在的位置:网站首页 › bert tokenizer 原理 › transformers里的AutoTokenizer和AutoModel的原理分析(一) |
以情感分析的例子来进行剖析transformers里的AutoTokenizer和AutoModel:
通过ipython进行查看sentiment-analynise依赖的Hugging Face – The AI community building the future.We’re on a journey to advance and democratize artificial intelligence through open source and open science. 里的库: 从上面的输出得到了pipeline模式进行预训练所用到的库: distilbert-base-uncased-finetuned-sst-2-english 如果 test_sentences = ("today is not that bad", "today is so bad" , "so good")里面有三个元素,则会报错。 因为 inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt")里的句子必须是一个或者Tuple()里面两个元素!!! 而 使用tokenizer()则不会报错!!! 底层代码分析后,发现: 当输入为一个字符串,则tokenizer() == tokenizer.encode_plus(); 当输入的是list或者tuple,则tokenizer() == tokenizer.batch_encode_plus() padding=True: test_sentences = ("today is not that bad", "today is so bad", "so good")表示当 test_sentences里面的所有句子长度不同时,padding会以最长的句子长度作为max_length进行填充[PAD],即补零。 padding="max_length": test_sentences = ("today is not that bad", "today is so bad", "so good")padding="max_length"一般配合max_length=XXXX 参数使用,以max_length长度进行补零。 解释一下tokenizer.batch_encode_plus() 或者 tokenizer.encode_plus()的返回值:input_ids:表示tokenizer编码后的vocab对应的id attention_mask:其中的1表示未被padding的词,0表示被padding的词。 与tokenizer.batch_encode_plus() 或者 tokenizer.encode_plus()的底层语句: tokenizer.convert_tokens_to_ids(tokenizer.tokenize(test_sentences[0]))这里使用的是:AutoModelForSequenceClassification 返回值logits是维度为(3,2)的tensor,我们处理一下就得到了[1,0,1], [1,0,1]代表什么呢? 我们看它的config: 由上得出,1代表positive,0代表negative。 完整的代码: # ---encoding:utf-8--- # @Time : 2023/8/1 10:54 # @Author : CBAiotAigc # @Email :[email protected] # @Site : # @File : tokenizer_sentiment_analysis.py # @Project : AI_Review # @Software: PyCharm from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification import torch import torch.nn as nn model_name = "../model/distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) test_sentences = ["today is not that bad", "today is so bad", "so good"] # inputs_tensor = tokenizer.encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt") # print(inputs_tensor) inputs_tensor = tokenizer(test_sentences, padding=True, truncation=True, return_tensors="pt") print(inputs_tensor) inputs_tensor = tokenizer.batch_encode_plus(test_sentences, padding=True, truncation=True, return_tensors="pt") print(inputs_tensor) outputs = model(**inputs_tensor) print(outputs) model.eval() with torch.no_grad(): labels = torch.argmax(outputs.logits, dim=-1) print(labels) print(model.config.id2label) print([model.config.id2label[id] for id in labels.tolist()]) |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |